Add data preparation utilities for easier onboarding by igerber · Pull Request #10 · igerber/diff-diff

igerber · 2026-01-03T12:38:05Z

Add a new prep.py module with utility functions to help users prepare
their data for DiD analysis:

generate_did_data: Create synthetic data with known treatment effect
make_treatment_indicator: Convert categorical/numeric to binary treatment
make_post_indicator: Create post-treatment indicator from time columns
wide_to_long: Reshape wide panel data to long format
balance_panel: Balance unbalanced panel data
validate_did_data: Check data meets DiD requirements with helpful errors
summarize_did_data: Get summary statistics by treatment-time cells
create_event_time: Create event-time for staggered adoption designs
aggregate_to_cohorts: Aggregate unit data to cohort means

Includes comprehensive tests and README documentation.

Add a new prep.py module with utility functions to help users prepare their data for DiD analysis: - generate_did_data: Create synthetic data with known treatment effect - make_treatment_indicator: Convert categorical/numeric to binary treatment - make_post_indicator: Create post-treatment indicator from time columns - wide_to_long: Reshape wide panel data to long format - balance_panel: Balance unbalanced panel data - validate_did_data: Check data meets DiD requirements with helpful errors - summarize_did_data: Get summary statistics by treatment-time cells - create_event_time: Create event-time for staggered adoption designs - aggregate_to_cohorts: Aggregate unit data to cohort means Includes comprehensive tests and README documentation.

- Add Optional type hint to cluster parameter in DifferenceInDifferences - Use significance_stars property in DiDResults.__repr__ instead of inline logic - Add clarifying comments for vcov computation using solve() - Import LinAlgError directly for cleaner exception handling

Review of "Add data preparation utilities for easier onboarding" with: - 9 specific issues identified (4 medium, 5 low priority) - Performance issue: iterrows() in wide_to_long should use pd.melt() - Bug: groupby().ffill() doesn't work as expected in balance_panel - Logic issue: multi-period validation incomplete - Missing test coverage for ffill path - Overall positive assessment with requested changes

Both PR #9 and PR #10 review feedback has been addressed: PR #9 fixes verified: - Optional type hint added to cluster parameter - significance_stars property used in __repr__ - Clarifying comments added for vcov computation - LinAlgError imported directly PR #10 fixes verified: - iterrows() replaced with pd.melt() in wide_to_long - groupby().ffill() bug fixed in balance_panel - Multi-period validation logic corrected - Index labeling fixed for non-0/1 time values - np.isinf() guard added for non-numeric types - Test added for balance_panel ffill path

- Add detailed explanation in Section 14 of why pre-treatment effects differ between Callaway-Sant'Anna (varying base) and Sun-Abraham (fixed reference period e=-1), while post-treatment effects match - Enhance comparison code to show CS with both base_period options - Add point #10 to tutorial summary documenting expected behavior - Add test documenting this methodological difference - Update REGISTRY.md with cross-reference note Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…StageDiD Addresses axis-C findings #8, #9, and #10 from the silent-failures audit: three sites where a sparse factorization failure silently fell back to dense lstsq without any user-facing signal. - diff_diff/imputation.py:1516 (variance path: scipy.sparse.linalg.spsolve on (A_0' W A_0) z = A_1' w). Bare `except Exception` was swallowing the root cause before dense lstsq. Now emits a UserWarning identifying the exception type and explaining the fallback implication. - diff_diff/two_stage.py:1647 (GMM sandwich: sparse_factorized on X'_{10} W X_{10} for Stage 1 normal equations). `except RuntimeError` was silent; now emits a UserWarning. - diff_diff/two_stage_bootstrap.py:134 (bootstrap path: same pattern as above). `except RuntimeError` was silent; now emits a UserWarning. All three are single-call sites (per fit, or per aggregation level, or per bootstrap replicate at most a handful of times) so no aggregation wrapper pattern is needed — one warning per fallback event is appropriate. REGISTRY.md updated under ImputationDiD and TwoStageDiD. New tests (3): monkey-patch the sparse entry point to raise a RuntimeError, run .fit(), assert the UserWarning fires with the expected message prefix. Works against both the variance and bootstrap surfaces. Axis-C baseline: 3 major silent-fallback sites (imputation, two_stage, two_stage_bootstrap) -> 0 remaining in these files. PowerAnalysis simulation counter (finding #11) and ContinuousDiD B-spline (#12) still open as separate follow-ups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lift the gate at chaisemartin_dhaultfoeuille.py:1233-1239 so per-path event-study effects compose with survey_design under analytical Binder TSL SE and replicate-weight bootstrap variance. Multiplier bootstrap (n_bootstrap > 0) under survey + by_path remains gated; the survey-aware perturbation pivot for path-restricted IFs is methodologically underived and deferred to a future wave. Per-path SE routes through the existing _survey_se_from_group_if cell-period allocator. The per-period IF (U_pp_l_path) with non-path switcher contributions zeroed at both group and cell levels (the row-sum identity U_pp.sum(axis=1) == U is preserved trivially under group-level zeroing) is cohort-recentered via _cohort_recenter_per_period, then expanded to observations as psi_i = U_pp[g_i, t_i] * (w_i / W_{g_i, t_i}). Replicate-weight designs unconditionally use the cell allocator (Class A contract, PR #323). New _refresh_path_inference helper post-call refreshes safe_inference on every populated entry across multi_horizon_inference, placebo_horizon_inference, path_effects, and path_placebos so all four surfaces reflect the same final df_survey after per-path replicate fits append n_valid to the shared accumulator. Path-enumeration ranking under survey_design remains unweighted (group-cardinality, not population-weight mass). Lonely-PSU policy stays sample-wide. Telescope invariant holds bit-exactly: on a single-path panel, per-path SE matches the global non-by_path survey SE. No R parity — R did_multiplegt_dyn does not support survey weighting; this is a Python-only methodology extension. 14 new tests across two test classes: - TestByPathSurveyDesignAnalytical: gate dispatch, anti-regression on global TSL+bootstrap (locks per-path-only gate scope), per-path analytical SE, single-path telescope, replicate-weight SE, df_survey propagation, per-path placebos, trends_linear cumulated SE inheritance, unobserved-path warnings under survey. - TestByPathSurveyDesignTelescope: single-path telescoping invariant for analytical TSL. Documentation: REGISTRY.md "Per-path survey-design SE" sub-paragraph; by_path / paths_of_interest docstrings updated; CHANGELOG entry; docs/api/chaisemartin_dhaultfoeuille.rst and llms-full.txt updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Compose by_path / paths_of_interest with survey_design (Wave 4 #10)

Closes the WooldridgeDiD (ETWFE) methodology-review-tracker promotion in METHODOLOGY_REVIEW.md (In Progress → Complete), following the primary-source review for Wooldridge (2025) merged in PR-A (#484). Adds two paper-driven implementation surfaces and extends R-parity goldens to the nonlinear paths. Implementation: - `aggregate(weights="cohort_share")` on WooldridgeDiDResults implements paper Eqs. 7.4 (simple-overall) and 7.6 (event-time, restricted to k>=0) cohort-share aggregation weights as an opt-in alternative to the default cell-count weighting (matching Stata `jwdid_estat`). Inference fields fail-closed to NaN with UserWarning per paper Section 7.5 conditional-on-shares semantics; raises on `survey_design` (design-consistent totals deferred); raises on `type ∈ {"group","calendar"}` (no paper closed-form); raises on bootstrap fits (no matching bootstrap variant). Closes TODO row 95. - `cohort_trends=True` on `WooldridgeDiD.__init__` adds linear `dg_i · t` cohort-specific trend interactions (paper Section 8 / Eq. 8.1) for the OLS path. Rejects on logit/poisson per paper Section 8 OLS scope; rejects on survey_design pending full-dummy/TSL validation; enforces per-cohort pre-period identification check (≥ 2 observed pre-periods per treated cohort). Auto-routes to full-dummy mode regardless of vcov_type. Closes the PR-A Requirements Checklist heterogeneous-trends gap. Tests: - `tests/test_methodology_wooldridge.py` extended with 6 paper-equation-numbered methodology classes (Theorem 3.1, Proposition 5.1, Section 6 event study, Section 7 aggregation paths, Section 8 heterogeneous trends, Section 10 unbalanced panels) + `TestW2025LibraryDeviations` consolidating 5 surviving deviations. Mirrors the HAD PR #473 precedent. - Two new R-parity surface classes (`TestWooldridgeParityRPoisson`, `TestWooldridgeParityRLogit`) lock the structural surface against R `etwfe(family=...)` log-link goldens. - 209 tests total (60 methodology + 149 R-parity + unit regressions). R Goldens: - `benchmarks/R/generate_wooldridge_golden.R` extended with Poisson + logit DGPs via R `etwfe`; augmented panel CSV retains the same seed-generated `y_pois` + `y_logit` columns for cross-language reproducibility. - `benchmarks/R/requirements.R` pins `etwfe >= 0.5.0`. Tracker promotion: - METHODOLOGY_REVIEW.md L52 status flip with merge date; detail section L583-605 rewritten to the Verified Components / Test Coverage / Corrections Made / Deviations / Outstanding Concerns template mirroring HAD / ContinuousDiD / DCDH. L27 example re-pointed; priority queue items #7-#10 renumbered to #6-#9. - REGISTRY.md `## WooldridgeDiD (ETWFE)` extended with `### Deviations from the paper / from R / library extensions` block consolidating 7 surviving deviations + opt-in notes for cohort_share + cohort_trends + survey rejection + bootstrap cohort_share rejection contracts. - CHANGELOG.md `[Unreleased]` `### Added` documents the new parameters, R-parity extension, and tracker flip. - `docs/methodology/papers/wooldridge-2025-review.md` Requirements Checklist + Gaps & Uncertainties items 1 + 11 marked `**Status:** Closed in PR-B`. - `docs/api/wooldridge_etwfe.rst` updated with weighting-scheme notes alongside the existing aggregation table. Second of two PRs for the WooldridgeDiD methodology-review-tracker promotion. PR-A merged at e416aed (#484). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Flip the ChaisemartinDHaultfoeuille (DCDH) row from In Progress to Complete. Adds the Verified Components / Test Coverage / Corrections Made / Deviations / Outstanding Concerns detail section mirroring the ContinuousDiD (PR igerber#476) and HAD (PR igerber#473) precedents. Consolidates 7 DCDH deviations from the paper, from R DIDmultiplegtDYN, and library extensions into a labeled REGISTRY surface per the AI-review "Documenting Deviations" convention. CHANGELOG [Unreleased] gains a new Added entry. L27 In Progress example re-pointed to WooldridgeDiD; L1289 priority-order queue item igerber#6 removed and items igerber#7-igerber#11 renumbered to igerber#6-igerber#10. No source code changes, no new tests, no new docstrings — documentation consolidation only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

claude added 2 commits January 3, 2026 11:51

igerber merged commit 022b721 into main Jan 3, 2026

igerber deleted the claude/add-data-prep-utilities-4Unoc branch January 3, 2026 12:50

igerber mentioned this pull request Jan 22, 2026

Explain CS vs SA pre-period discrepancy in Tutorial 02 #102

Merged

igerber added a commit that referenced this pull request May 10, 2026

Merge pull request #408 from igerber/dcdh-by-path-survey-design

8bd021d

Compose by_path / paths_of_interest with survey_design (Wave 4 #10)

igerber mentioned this pull request May 21, 2026

ChaisemartinDHaultfoeuille (DCDH) methodology-review-tracker promotion: In Progress -> Complete #481

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add data preparation utilities for easier onboarding#10

Add data preparation utilities for easier onboarding#10
igerber merged 2 commits into
mainfrom
claude/add-data-prep-utilities-4Unoc

igerber commented Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

igerber commented Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants